Most research on task oriented dialog modeling is based on written text input. However, users interact with practical dialog systems often using speech as input. Typically, systems convert speech into text using an Automatic Speech Recognition (ASR) system, introducing errors. Furthermore, these systems do not address the differences in written and spoken language. The research on this topic is stymied by the lack of a public corpus. Motivated by these considerations, our goal in hosting the speech-aware dialog state tracking challenge was to create a public corpus or task which can be used to investigate the performance gap between the written and spoken forms of input, develop models that could alleviate this gap, and establish whether Text-to-Speech-based (TTS) systems is a reasonable surrogate to the more-labor intensive human data collection. We created three spoken versions of the popular written-domain MultiWoz task -- (a) TTS-Verbatim: written user inputs were converted into speech waveforms using a TTS system, (b) Human-Verbatim: humans spoke the user inputs verbatim, and (c) Human-paraphrased: humans paraphrased the user inputs. Additionally, we provided different forms of ASR output to encourage wider participation from teams that may not have access to state-of-the-art ASR systems. These included ASR transcripts, word time stamps, and latent representations of the audio (audio encoder outputs). In this paper, we describe the corpus, report results from participating teams, provide preliminary analyses of their results, and summarize the current state-of-the-art in this domain.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Denoising diffusion models hold great promise for generating diverse and realistic human motions. However, existing motion diffusion models largely disregard the laws of physics in the diffusion process and often generate physically-implausible motions with pronounced artifacts such as floating, foot sliding, and ground penetration. This seriously impacts the quality of generated motions and limits their real-world application. To address this issue, we present a novel physics-guided motion diffusion model (PhysDiff), which incorporates physical constraints into the diffusion process. Specifically, we propose a physics-based motion projection module that uses motion imitation in a physics simulator to project the denoised motion of a diffusion step to a physically-plausible motion. The projected motion is further used in the next diffusion step to guide the denoising diffusion process. Intuitively, the use of physics in our model iteratively pulls the motion toward a physically-plausible space. Experiments on large-scale human motion datasets show that our approach achieves state-of-the-art motion quality and improves physical plausibility drastically (>78% for all datasets).
translated by 谷歌翻译
The success of deep neural networks requires both high annotation quality and massive data. However, the size and the quality of a dataset are usually a trade-off in practice, as data collection and cleaning are expensive and time-consuming. Therefore, automatic noisy label detection (NLD) techniques are critical to real-world applications, especially those using crowdsourcing datasets. As this is an under-explored topic in automatic speaker verification (ASV), we present a simple but effective solution to the task. First, we compare the effectiveness of various commonly used metric learning loss functions under different noise settings. Then, we propose two ranking-based NLD methods, inter-class inconsistency and intra-class inconsistency ranking. They leverage the inconsistent nature of noisy labels and show high detection precision even under a high level of noise. Our solution gives rise to both efficient and effective cleaning of large-scale speaker recognition datasets.
translated by 谷歌翻译
An undirected weighted graph (UWG) is frequently adopted to describe the interactions among a solo set of nodes from real applications, such as the user contact frequency from a social network services system. A graph convolutional network (GCN) is widely adopted to perform representation learning to a UWG for subsequent pattern analysis tasks such as clustering or missing data estimation. However, existing GCNs mostly neglects the latent collaborative information hidden in its connected node pairs. To address this issue, this study proposes to model the node collaborations via a symmetric latent factor analysis model, and then regards it as a node-collaboration module for supplementing the collaboration loss in a GCN. Based on this idea, a Node-collaboration-informed Graph Convolutional Network (NGCN) is proposed with three-fold ideas: a) Learning latent collaborative information from the interaction of node pairs via a node-collaboration module; b) Building the residual connection and weighted representation propagation to obtain high representation capacity; and c) Implementing the model optimization in an end-to-end fashion to achieve precise representation to the target UWG. Empirical studies on UWGs emerging from real applications demonstrate that owing to its efficient incorporation of node-collaborations, the proposed NGCN significantly outperforms state-of-the-art GCNs in addressing the task of missing weight estimation. Meanwhile, its good scalability ensures its compatibility with more advanced GCN extensions, which will be further investigated in our future studies.
translated by 谷歌翻译
This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models. Despite the great success, we observe that existing VL-models still lack commonsense knowledge/reasoning ability (e.g., "Lemons are sour"), which is a vital component towards artificial general intelligence. Through our analysis, we find one important reason is that existing large-scale VL datasets do not contain much commonsense knowledge, which motivates us to improve the commonsense of VL-models from the data perspective. Rather than collecting a new VL training dataset, we propose a more scalable strategy, i.e., "Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE). It can be viewed as one type of data augmentation technique, which can inject commonsense knowledge into existing VL datasets on the fly during training. More specifically, we leverage the commonsense knowledge graph (e.g., ConceptNet) and create variants of text description in VL datasets via bidirectional sub-graph sequentialization. For better commonsense evaluation, we further propose the first retrieval-based commonsense diagnostic benchmark. By conducting extensive experiments on some representative VL-models, we demonstrate that our DANCE technique is able to significantly improve the commonsense ability while maintaining the performance on vanilla retrieval tasks. The code and data are available at https://github.com/pleaseconnectwifi/DANCE
translated by 谷歌翻译
Graph neural networks (GNNs) have been demonstrated to be a powerful algorithmic model in broad application fields for their effectiveness in learning over graphs. To scale GNN training up for large-scale and ever-growing graphs, the most promising solution is distributed training which distributes the workload of training across multiple computing nodes. However, the workflows, computational patterns, communication patterns, and optimization techniques of distributed GNN training remain preliminarily understood. In this paper, we provide a comprehensive survey of distributed GNN training by investigating various optimization techniques used in distributed GNN training. First, distributed GNN training is classified into several categories according to their workflows. In addition, their computational patterns and communication patterns, as well as the optimization techniques proposed by recent work are introduced. Second, the software frameworks and hardware platforms of distributed GNN training are also introduced for a deeper understanding. Third, distributed GNN training is compared with distributed training of deep neural networks, emphasizing the uniqueness of distributed GNN training. Finally, interesting issues and opportunities in this field are discussed.
translated by 谷歌翻译
视频亮点检测长期以来一直是计算机视觉任务中的主题,挖掘出未接触的原始视频输入的用户出现剪辑。但是,在大多数情况下,这一研究中的主流方法建立在封闭的世界假设上,在封闭的世界假设中,固定数量的突出显示类别是提前正确定义的,并且需要同时可用的所有培训数据,并且作为一个结果,相对于突出显示类别和数据集大小的可伸缩性差。为了解决上面提到的问题,我们提出了一个视频突出显示检测器,能够逐步学习,即\ textbf {g} lobal \ textbf {p} rototype \ textbf {e} ncoding(gpe),捕获新定义的视频亮点。通过其相应的原型扩展数据集。除此之外,我们提供了一个注释且昂贵的数据集,称为\ emph {Bytefood},包括超过5.1k的美食视频属于\ emph {cooke},\ emph {eat},\ emph {food Material},\ emph {cooke},和\ emph {演示}。据我们所知,这是第一次将增量学习设置引入视频突出显示检测,从而减轻培训视频输入的负担,并促进了按数据集的大小成比例的传统神经网络的可扩展性和域的数量。此外,所提出的GPE超过了\ emph {Bytefood}上的当前增量学习方法,至少报告了1.57 \%MAP的改善。代码和数据集将更早提供。
translated by 谷歌翻译
在动态环境中,持续增强学习(CRL)的关键挑战是,随着环境在其生命周期的变化,同时最大程度地减少对学习的信息的灾难性忘记,随着环境在其一生中的变化而变化。为了应对这一挑战,在本文中,我们提出了Dacorl,即动态自动持续RL。 Dacorl使用渐进式上下文化学习了上下文条件条件的策略,该策略会逐步将动态环境中的一系列固定任务群集成一系列上下文,并选择一个可扩展的多头神经网络以近似策略。具体来说,我们定义了一组具有类似动力学的任务,并将上下文推理形式化为在线贝叶斯无限高斯混合物集群的过程,这些过程是在环境特征上,诉诸在线贝叶斯推断,以推断上下文的后端分布。在以前的中国餐厅流程的假设下,该技术可以将当前任务准确地分类为先前看到的上下文,或者根据需要实例化新的上下文,而无需依靠任何外部指标来提前向环境变化发出信号。此外,我们采用了可扩展的多头神经网络,其输出层与新实例化的上下文同步扩展,以及一个知识蒸馏正规化项来保留学习任务的性能。作为一个可以与各种深度RL算法结合使用的一般框架,Dacorl在稳定性,整体性能和概括能力方面具有一致的优势,而不是现有方法,这是通过对几种机器人导航和Mujoco Socomotion任务进行的广泛实验来验证的。
translated by 谷歌翻译
本文回顾了AIM 2022上压缩图像和视频超级分辨率的挑战。这项挑战包括两条曲目。轨道1的目标是压缩图像的超分辨率,轨迹〜2靶向压缩视频的超分辨率。在轨道1中,我们使用流行的数据集DIV2K作为培训,验证和测试集。在轨道2中,我们提出了LDV 3.0数据集,其中包含365个视频,包括LDV 2.0数据集(335个视频)和30个其他视频。在这一挑战中,有12支球队和2支球队分别提交了赛道1和赛道2的最终结果。所提出的方法和解决方案衡量了压缩图像和视频上超分辨率的最先进。提出的LDV 3.0数据集可在https://github.com/renyang-home/ldv_dataset上找到。此挑战的首页是在https://github.com/renyang-home/aim22_compresssr。
translated by 谷歌翻译